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Abstract — This paper is Bayesian Linear Regression in Data 
Mining and methods substantially differ from the new trend of 
Data Mining From a Statistical perspective Data Mining can be 
viewed as computer automated exploratory data analysis of 
large complex data sets. Despite the obvious connections 
between data mining and statistical data analysis most of the 
methodologies used in Data Mining have so far originated in 
fields other than Statistics. 


Index Terms — Ordinary Least Square Method, Bayesian Linear 
Regression Method, Ordinary Least Square Estimation , 
Conjugate Prior Distribution, Posterior Predictive Distribution. 


I. Introduction 

Data Mining (DM) is at best a vaguely defined field; its 
definition largely depends on the background and views of 
the definer. This also represents a main characteristic of it: 
From Pattern Recognition Data mining is the nontrivial 
process of identifying valid, novel, potentially useful, and 
ultimately understandable patterns in data; From Data Base, 
Data Mining is the process of extracting previously 
unknown, comprehensible, and actionable information from 
large databases and using it to make crucial business 
decisions; From machine learning, Data Mining is a set of 
methods used in the knowledge discovery process to 
distinguish previously unknown relationships and patterns 
within data. There are many different techniques for data 
mining. Often which technique you choose to use is 
determined by the type of data you have and the type of 
information you are trying to determine from the data. The 
most popular data mining methods in current use are 
classification, clustering, neural networks, association, 
sequence-based analysis, estimation, and visualization. 
Demonstrating that statistics, like data mining, is 
concerned with turning data into information and 
knowledge, even though the terminology may differ, in this 
section we present a major statistical approach being used 
in data mining, namely regression analysis. In the late 
1990s, statistical methodologies such as regression 
analysis were not included in commercial data mining 
packages. Nowadays, most commercial data mining 
software includes many statistical tools and in particular 
regression analysis. Although regression analysis may seem 
simple and anachronistic, it is a very powerful tool in DM 
with large data sets, especially in the form of the generalized 
linear models (GLMs). We emphasize the assumptions of the 
models being used and how the underlying approach differs 
from that of machine learning. 
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Bayesian inference is a method of statistical inference in 
which Bayes' theorem is used to update the probability for a 
hypothesis as evidence is acquired. Bayesian inference is an 
important technique in statistics, and especially in 
mathematical statistics. Bayesian updating is particularly 
important in the dynamic analysis of a sequence of data. 

Bayesian inference has found application in a wide range 
of activities, including science, engineering, philosophy, 
medicine, and law. In the philosophy of decision theory, 
Bayesian inference is closely related to subjective probability, 
often called "Bayesian probability". The components of 
Bayesian statistical inference consist of the prior information, 
the sample data, calculation of the posterior density of the 
parameters and sometimes calculation of the predictive 
distribution of future observations. 

From the posterior density one may make inferences for /i 
by examining the posterior density. Some prefer to give 
estimates of p , either point or interval estimates which are 
computed from the posterior distribution. If /i is 
one-dimensional, a plot of its posterior density tells one the 
story about /?, but if /? multidimensional one must be able to 
isolate those components of /? one is interested in.The 
following diagram is method of Bayesian inference. 


Algorithm and Procedure of OLS and Bayesian Linear 
Regression method: 


Ordinary Least Square Method: 

Step 1: Arranging the given observation in ascending order. 
Step 2: Find the Normal Equation. 

Step 3: Solve the Normal Equation and get the unknown 


parameter value /j. 

$ = QPxy 1 : 

A 

Step 4: Substitute the /J value and form the regression 
equation. 


l X T y 


Step 5: For Prediction, substitute the different values for X and 


y 

Bayesian Linear Regression Method: 

Step 1: Find the Joint Posterior from the Prior and Likelihood 
function. 

Step 2: From the Joint Posterior, find the Marginal Posterior 
distribution. 

Step 3: Find the conditional probability of unknown parameter 
/j for the given variable 

Step 4: Find the mean of the posterior distribution. Then the 
mean of the posterior distribution is unbiased for the unknown 

A 

parameter /? . 

Step 5: For prediction, substitute the different values of X and 
y in the Posterior predictive distribution. 


Linear Regression Model 
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We are considering a random variable y as a function of a 
(typically non-random) vector valued variable X E R . 
This is modeled as a linear relationship, with coefficients jSy 
and i.i.d Gaussian random noise, 

y , i = +*aft+ . + + £; 

Where £ f ~JV(0, <J 2 ) [ Mean vector Zero and 
2 

covariance matrix vector (J ] 

In Matrix form, this looks like 
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Simply, 


y = xp + e.(i) 

It is common to set the first column of X to a constant of 


l’s, so that jS-l is an intercept term. In the equation (1), the jS 
is the unknown parameter. So, we find the unknown 


parameter value /3 using the method of “Ordinary Least 
Square” (OLS). 

Ordinary Least Square Estimation 

Our aim is to estimate the unknown parameter /j. The 
MLE of j 8 is based on the Gaussian likelihood, 


P(y/x,(3: (T 2 ) = 


(2iza 2 ) n h 


exp 


2tr 2 


lly - 


This is the product of likelihood for each of the individual 

components [y=yi,y 2 ,.y„]. 

Therefore, we take log of this likelihood, 


logP(y/x I p:a 2 ) = l°g\^ T J+log 


exp 



logP(y/x,(]: a 2 ) = log 1 - log [{2ua 2 fh\ -^(y- xf$) T (y - 


4 ) 


logP(y/x,fj:a 2 ~) = 0 — n j^ log[2no 2 1 — 

A [y T y-y T xp-p T X T Y+ p T X T Xp] 

2 CF"“ 

logP(y/x,{3 : a 2 ) = - n j 2 log[2na 2 ] - 

“i \y T y-zp T x T y + p T x T xp]...( 2 ) 

Differentiate w.r.to in equation (2), 

dlogP 

djA ° 

-0 - [-2 X T y + 2X T Xfi] - 0 
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~— 7 2 [-X T y + X T xp] =0 

2 

-^[-X T y + X T X{3] = 0 
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-X T y + X T Xfl = 0 

X T X{3 = X T y 

P = (X T X)~ 1 X T y .(3) 

Where the inverse here is the Moore-Penrose 
pseudoinverse (same as the inverse when it exists). The MLE 
for /? is Gaussian distributed and unbiased. 

^~ATG?,(T 2 (X’ t *)“ 1 ). 

Bayesian Linear Regression 

In statistics, Bayesian linear regression is an approach to 
linear regression in which the statistical analysis is undertake 
n within the 

context of Bayesian inference. When the regression model has erro 
rs that have a normal distribution, and if a particular form of prior 
distribution is assumed, explicit results are available for the posterio 
r probability distributions of the model's parameters. 


The P can be overinflated for higher-order coefficients, as the 
model tries to over fit the data with a “wiggly” curve. To counteract 
this, we may inject our prior belief that these coefficients should not 
be so large. So, we introduce a conjugate Gaussian 
prior, J3- N(0, A -1 ) . Here we are parameterizing the 
Gaussian using the inverse covariance, or precision matrix A, 
which will make computations easier. A common choice is A = AI, 
for a positive scalar parameter X. 

Conjugate Prior Distribution 

For an arbitrary prior distribution, there may be no 
analytical solution for the posterior distribution. In this 
^Stipn. we will consider a so called conjugate prior for which 
the posterior distribution can be derived analytically. 
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Posterior Distribution 

Then, the Posterior for [3 is, 
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Just as we worked out in the univariate case, the 


conjugate prior for |3 results in the posterior also being a 
multivariate Gaussian. Completing the square inside the exponent, 
we see that the posterior for j(3 has the following distribution. 
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-■ r -- 

Where, then the Estimate of (3= p= Mean of the Posterior 
distribution (J.i n ) 


^ = (, X T X + a 2 Ay 1 
S n = a 2 (X T X + <j 2 A) -1 


Posterior Predictive Distribution 

In statistics, and especially Bayesian statistics, the 
posterior predictive distribution is the distribution of 
unobserved observations (prediction) conditional on the 
observed data. Described as the distribution that a new i.i.d. 
data point would have, given a set of N existing i.i.d. 
observations. In a frequentist context, this might be derived 
by computing the maximum likelihood estimate (or some 
other estimate) of the parameter(s) given the observed data, 
and then plugging them into the distribution function of the 
new observations. However, the concept of posterior 
predictive distribution is normally used in a Bayesian context, 
where it makes use of the entire posterior distribution of the 
parameter(s) given the observed data to yield a probability 
distribution over an interval rather than simply a point 
estimate. Specifically, it is computed by marginalizing over 
the parameters, using the posterior distribution. 

Now, say we are given a new independent data point X , and we 
would like to predict the corresponding unseen dependent value, y . 
The posterior predictive distribution of y is given by, 

Piy/y, x.x.a 2 , A) = ( P(y//?; x,a 2 ) P(fi/y;x,a 2 ,A) 

This is now a univariate Gaussian; 

y/y ~ N (x T v n , rJ n(X)) 

a 2 (:f) = a 2 + x T X n x 

The first term on the right is due to the noise (the 
additive G), and the second term is due to the posterior 
variance of /?, which represents our uncertainty in the parameters. 

Interpretation: 

Bayesian Statistics takes into account prior information in 
sample set up. It can update information through the Bayes formula 
to modify according to the latest results and also having less error. 
So we conclude that the Bayesian Linear Regression is most 
appropriate method for prediction. 

Demonstrating that statistics, like data mining is 
concerned with turning data into information and knowledge, even 
though the terminology may differ, in this project we present a 
major statistical approach being used in Data Mining, namely 
Bayesian Linear Regression. In Practical, the Bayesian Linear 
Regression is most advantage method for calculating prediction in 
Data Mining. 
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